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Scanning and Sequential Decision Making for 
Multi-Dimensional Data - Part II: the Noisy Case* 

Asaf Cohen j Tsachy Weissman^and Neri Merhav^ 
February 5, 2008 

Abstract 

We consider the problem of sequential decision making on random fields corrupted by noise. 
In this scenario, the decision maker observes a noisy version of the data, yet judged with respect 
to the clean data. In particular, we first consider the problem of sequentially scanning and 
filtering noisy random fields. In this case, the sequential filter is given the freedom to choose 
the path over which it traverses the random field (e.g., noisy image or video sequence), thus it 
is natural to ask what is the best achievable performance and how sensitive this performance 
is to the choice of the scan. We formally define the problem of scanning and filtering, derive 
a bound on the best achievable performance and quantify the excess loss occurring when 
non-optimal scanners are used, compared to optimal scanning and filtering. 

We then discuss the problem of sequential scanning and prediction of noisy random fields. 
This setting is a natural model for applications such as restoration and coding of noisy im- 
ages. We formally define the problem of scanning and prediction of a noisy multidimensional 
array and relate the optimal performance to the clean scandictability defined by Merhav and 
Weissman. Moreover, bounds on the excess loss due to sub-optimal scans are derived, and a 
universal prediction algorithm is suggested. 
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This paper is the second part of a two-part paper. The first paper dealt with sequential 
decision making on noiseless data arrays, namely, when the decision maker is judged with 
respect to the same data array it observes. 

1 Introduction 

Consider the problem of sequentially scanning and filtering (or predicting) a multidimensional 
noisy data array, while minimizing a given loss function. Particularly, at each time instant 
t, I < t < \B\, where \B\ is the number of sites ("pixels") in the data array, the sequential 
decision maker chooses a site to be visited, denoted by In the filtering scenario, it first 
observes the value at that site, and then gives an estimation for the underlying clean value. 
In the prediction scenario, it is required to give a prediction for that (clean) value, before 
the actual observation is made. In both cases, both the location and the estimation or 
prediction may depend on the previously observed values - the values at sites ^'i to ^t-i- The 
goal is to minimize the cumulative loss after scanning the entire data array. 

Applications of this problem can be found in image and video processing, such as filtering 
or predictive coding. In these applications, one wishes to either enhance or jointly enhance 
and code a given image. The motivation behind a prediction/compression-based approach, 
is that the prediction error may consist mainly of the noise signal, while the clean signal is 
recovered by the predictor. For example, see [1]. It is clear that different scanning patterns of 
the image may result in different filtering or prediction errors, thus, it is natural to ask what 
is the performance of the optimal scanning strategy, and what is the loss when non-optimal 
strategies are used. 

The problem of scanning multidimensional data arrays also arises in other areas of image 
processing, such as one-dimensional wavelet [2] or median [3] processing of images, where 
one seeks a space-filling curve which facilitates the one-dimensional signal processing of the 
multidimensional data. Other examples include digital halftoning [4], where a space filling 
curve is sought in order to minimize the effect of false contours, and pattern recognition 
[5]. Yet more applications can be found in multidimensional data query [6] and indexing [7], 
where multidimensional data is stored on a one-dimensional storage device, hence a locality- 
preserving space-filling curve is sought in order to minimize the number of continuous read 
operations required to access a multidimensional object, and rendering of three-dimensional 
graphics [1], [9]. 
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An information theoretic discussion of the scanning problem was initiated by Lempel and 
Ziv in [ini, where the Peano-Hilbert scan was shown to be optimal for compression of individual 
images. In Merhav and Weissman formally defined a "scandictor" , a scheme for sequen- 
tially scanning and predicting a multidimensional data array, as well as the "scandictability" 
of a random field, namely, the best achievable performance for scanning and prediction of a 
random field. Particular cases where this value can be computed and the optimal scanning 
order can be identified were discussed in that work. One of the main results of [TT] is the 
fact that if a stochastic field can be represented autoregressively (under a specific scan ^) 
with a maximum-entropy innovation process, then it is optimally scandicted in the way it was 
created (i.e., by the specific scan ^ and its corresponding optimal predictor). A more compre- 
hensive survey can be found in [12] and [13]. In [12], the problem of universal scanning and 
prediction of noise-free multidimensional arrays was investigated. Although this problem is 
fundamentally different from its one-dimensional analogue (for example, one cannot compete 
successfully with any two scandictors on any individual image), a universal scanning and pre- 
diction algorithm which achieves the scandictability of any stationary random field was given, 
and the excess loss incurred when non-optimal scanning strategies are used was quantified. 

In |14j . Weissman, Merhav and Somekh-Baruch, as well as Weissman and Merhav in [15] 
and [16], extended the problem of universal prediction to the case of a noisy environment. 
Namely, the predictor observes a noisy version of the sequence, yet, it is judged with respect 
to the clean sequence. In this paper, we extend the results of [11] and |12j to this noisy scenario. 
We formally define the problem of sequentially filtering or predicting a multidimensional data 
array. First, we derive lower bounds on the best achievable performance. We then discuss 
the scenario where non-optimal scanning strategies are used. That is, we assume that, due to 
implementation constraints, for example, one cannot use the optimal scanner for a given data 
array, and is forced to use an arbitrary scanning order. In such a scenario, it is important to 
understand what is the excess loss incurred, compared to optimal scanning and filtering (or 
prediction). We derive upper bounds on this excess loss. Finally, we briefly mention how the 
results of [12] can be exploited in order to construct universal schemes to the noisy case as 
well. While many of the results for noisy scandiction are extendible from the noiseless case, 
similarly as results for noisy prediction were extended from results for noiseless prediction [15] , 
the scanning and flltering problem poses new challenges and requires the use of new tools and 
techniques. 

The paper is organized as follows. Section [2] includes a precise formulation of the problem. 
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Section [3] includes the results on scanning and filtering of noisy data arrays, while Section 
m is devoted to the prediction scenario. In both sections, particular emphasis is given to 
the important cases of Gaussian random fields corrupted by Additive White Gaussian Noise 
(AWGN), under the squared error criterion, and binary random fields corrupted by a Binary 
Symmetric Channel (BSC), under the Hamming loss criterion. 

In particular, in Section 13. H a new tool is used to derive a lower bound on the optimum 
scanning and filtering performance (Section 14.11 later shows how this tool can be used to 
strengthen the results of [11] in the noise- free scenario as well). Section [3. 2l gives upper bounds 
on the excess loss in non-optimal scanning. In Section [3 .2.11 the results of Duncan as well as 
those of Guo, Shamai and Verdii [18] are used to derive the bounds when the noise is Gaussian, 
and Section 13.2.21 deals with the binary setting. Section 13.31 uses recent results by Weissman et. 
al. [19| to describe how universal scanning and filtering algorithms can be constructed. In the 
noisy scandiction section. Section 14.11 relates the best achievable performance in this setting, 
as well as the achieving scandictors, to the clean scandictability of the noisy field. Section 
14.21 introduces a universal scandiction algorithm, and Section 14.31 gives an upper bound on 
the excess loss. In both Section [3] and Section HI the sub-sections describing the optimum 
performance, the excess loss bounds and the universal algorithms are not directly related and 
can be read independently. Finally, Section [5] contains some concluding remarks. 

2 Problem Formulation 

We start with a formal definition of the problem. Let A denote the alphabet, which is either 
discrete or the real line. Let N be the noisy observation alphabet. Let $7 = (A x N)'^^ be 
the observation space (the results can be extended to any finite dimension). A probability 
measure Q on VL is stationary if it is invariant under translations Tj, where for each uj ^VL and 
i,j G 1?, Ti{Lo)j = iOj-^-i (namely, stationarity means shift invariance). Denote by and 
Msi^) the sets of all probability measures on 0, and stationary probability measures on ^1, 
respectively. Elements of random fields, will be denoted by upper case letters while 

elements of il, individual data arrays, will be denoted by the corresponding lower case. It will 
also be beneficial to refer to the clean and noisy random fields separately, that is, {Xt}i^j2 
represents the clean signal and {Yt}f^^2 represents the noisy observations, where for t G Z^, 
Xt is the random variable corresponding to X at site t. 

Let V denote the set of all finite subsets of Z^. For V ^ V, denote by Xy the restrictions 
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of the data array X to V . Let TZ^ be the set of all rectangles of the form V = l?r\ {{rrii, 7712] x 
[ni, 71-2]). As a special case, denote by Vn the square {0, . . . , n — 1} x {0, . . . , n — 1}. For V C Z^, 
let the interior radius of y be 

R{V) = sup{r : 3c s.t. B{c,r) C V], (1) 

where B{c,r) is a closed ball (under the /i-norm) of radius r centered at c. Throughout, In(-) 
will denote the natural logarithm. 

Definition 1. A scanner-filter pair for a finite set of sites B £ V is the following pair (^, F): 

• The scan is a sequence of measurable mappings, "'i/f : N ^ B determining the 
site to be visited at time i, with the property that 

{^'i,^2(y*i),^3(yvti,y>]>2),---,^lBl (v<i-^,- ■ ■ ,y^^B^_,) } = b, yyeN^. (2) 

• is a sequence of measurable filters, where for each t, Ft : ^ D determines 
the reconstruction for the value at the site visited at time t, based on the current and 
previous observations, and D is the reconstruction alphabet. 

Note that both the scanner ^ and the filters {Ft} base their decisions only on the noisy 
observations. In the prediction scenario (i.e., noisy scandiction), we define Ft : N^~^ ^ D, 
that is, {Ft} represents measurable predictors, which have access only to previous observations. 
We allow randomized scanner-filter pairs, namely, pairs such that or {Ft}tJi can be 

chosen randomly from some set of possible functions. It is also important to note that we 
consider only scanners for finite sets of sites, ones which can be viewed merely as a reordering 
of the sites in a finite set B. 

The cumulative loss of a scanner-filter pair {^,F) up to time t < \B\ is denoted by 
(^-6,2/5)*, 

t 

L(^^P)ixB,yB)t = ^l{x^i,Fi{y^^,. . . ,y'^^)), (3) 

4 = 1 

where I : A x D [0, oo) is the loss function. The sum of the instantaneous losses over the 
entire data array B, L^^ p^{xB-,yB)\B\i will be abbreviated as L^^ p,-^{xB,yB)- 

For a given loss function / and a field Q £ restricted to B, define the best achievable 

scanning and filtering performance by 

Uil,QB)= mf Eq^^L,^pJXb,Yb), (4) 
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where Q_b is the marginal probabihty measure restricted to B and S{B) is the set of all possible 
scanner-filter pairs for B. The best achievable performance for the field Q, U{l,Q), is defined 
by 

Uil,Q) = hm U{l,QvJ, (5) 

n— »oo 

if this limit exists. 

In the prediction scenario, Ft is allowed to base its estimation only on y^^,. . . ,y^^_^^, and 
we have 

\B\ 

L(^,F){xB,yB) = ^l{x-^t,Ft{y-^^,. . . ,yxi,,_J), (6) 

i=l 

Uil,QB) = mi Eq^^L^^^f^{Xb,Yb), (7) 
(^,F) \r)\ 

and 

U{l,Q) = lim U{l,QvJ, (8) 

n— »oo 

if this limit exists. 

The following proposition asserts that for any stationary random field both the limit in ([5]) 
and the limit in ([8]) exist. 

Proposition 1. For any stationary field Q G Ms{^) and for any sequence {Bn}, Bn G T^Uj 
satisfying R{Bn) — > oo, the limits in ([5]) and ([8]) exist and satisfy 

U{l,Q) = lim U{1,QbJ = inf U{1,Qa), (9) 
Uil,Q) = \im U{1,QbJ= inf U{1,Qa)- (10) 

Since U{1,Qb) and U{1,Qb), possess the sub-additivity property, e.g., for any V,V',V H 
V' = 0, there exists a scanner-filter pair F) (or a scandictor (^', F)) on y U y' such that 

EqL^^^P^{Xvuv',Yvuv') < \V\U{l,Qv) + \V'\U{l,Qv'), (11) 
the proof of Proposition [1] follows verbatim that of Theorem 1]. 



3 Filtering of Noisy Data Arrays 

In this section, we consider the scenario of scanning and filtering. In this case, a lower bound on 
the best achievable performance is derived. For the cases of Gaussian random fields corrupted 
by AWGN and binary valued fields observed through a BSC, we derive bounds on the excess 
loss when a non-optimal scanner is used (with an optimal filter). Finally, we briefly discuss 
universal scanning and filtering. 

6 



3.1 A Lower Bound on the Best Achievable Scanning and Fil- 
tering Performance 

We assume an invertible memoryless channel, meaning the channel input distribution of a 
single symbol is uniquely determined given the output distribution. As an example, a discrete 
memoryless channel with an invertible channel matrix can be kept in mind. See [2Dj for 
a discussion on the conditions on the channel matrix for the invertibility property to hold. 
Moreover, as will be elaborated on later, the result below applies to more general channels, 
including continuous ones. 

In the case of an invertible channel, we define associated Bayes envelope by 

fl{P) = mm El{X,g{Y)), (12) 

where P is the distribution of the channel output Y. Define 

C{d) = max{H{P):fi{P)<d}, (13) 

and let be the upper concave (n) envelope of ("(•)• 

Theorem 2. Let Yb be the output of an invertible memoryless channel whose input is Xb- 
Then, for any scanner- filter pair F) we have 

c(^j^^Eq,L^^P^{Xb,Yb)^ > 1^^H{Yb), (14) 

that is, 

c{u{1,Qb))>j^^H{Yb). (15) 

Proof. We prove the above theorem for the discrete case. Yet, the derivations below apply to 
the continuous case as well, with summations replaced by the appropriate integrals and the 
entropy replaced by differential entropy. 
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Denote by ^'(Yb) the reordered output sequence, that is, {Y^^ , Y^^^ , • • • , Y^^^g^}. We have, 
H{Yb) H{m{YB)) 

\B\ 

= Y.H{Y^^Y^^-') 
t=i 

\B\ 

= E E H{Y^,\Y^^-^=y''^-^)P{y''^-^) 



t=l y^t-l 



|B1 



\B\ 



< Y.C{EQj[x^„Ft{YS 
t=i ' 

= |i?|c(^i?Q,V^)(^B,lB)). (16) 

The equahty (a) is since the reordering does not change the entropy of Yb ■ While this is clear 
for data-independent reordering, more caution is required when ^' is a data-dependent scan. 
Yet, this can be proved using the chain rule, and noting that conditioned on Y^^*"^, the next 
site is fixed (this is similar to the proof of [12] Proposition 13]). The inequalities (b) and 
(c) follow from the definitions of C and ( respectively, and (d) and (e) follow from Jensen's 
inequality. □ 

At this point, a few remarks are in order. Theorem [2] is the direct analogue of the lower 
bounds in [TT] for the filtering scenario. Note, however, that it holds for any finite set of 
sites B. Furthermore, it applies to arbitrarily distributed random fields (even non-stationary 
fields), and to a wide family of loss functions. In fact, the only condition on is that the 
associated Bayes envelope fi{P) is well defined. Note also that the lower bound on U{l,Q) 
given in Theorem [2] results from the application of a single letter function, C~^(")> to the 
normalized entropy of the noisy field, j^H{Yb). That is, the memory in {Xb,Yb) is reflected 
only in j^^H{Yb). 

The proof of Theorem [2] is general and direct, however, it lacks the insightful geometrical 
interpretation which led to the lower bound in [11]. Therein, Merhav and Weissman showed 
that the transformation from a data array to an error sequence (defined by a specific scandictor 
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{^,F)) is volume preserving. Thus, the least expected cumulative error is the radius of a 
sphere, whose volume is the volume of the set of all typical data arrays of the source. This 
happens when all the typical data arrays of the source map to a sphere in the "error vectors" 
space, and thus Merhav and Weissman were able to identify cases where the lower bound 
is tight. Currently, we cannot point out specific cases in which (jl5p is tight. Moreover, as 
the next two examples show, in the scanning and filtering scenario (unlike the scanning and 
prediction scenario we discuss in Section H]), (^(d) may not be concave, and thus ^(d) 7^ C{d)- 
Note, in this context, that there is no natural time sharing solution in this case, as there is no 
natural trade-off between two (or more) optimal points, and there is only one criterion to be 
minimized - the cumulative scanning and filtering loss (as opposed to rate versus distortion, 
for example). 

3.1.1 Binary Input and BSC 

To illustrate its use, we specialize Theorem [2] to the case of binary input through a BSC, i.e., 
the input random field Xv„ is binary, and Yy^ is the output of a BSC whose input is Xv„ 
and crossover probability is (5 < 1/2. Note, however, that although the derivations below are 
specific for binary alphabet and Hamming loss, they are easily extendible to arbitrary finite 
alphabet and discrete memoryless channel with a channel transition matrix 11 and loss function 
A(-,-)- 

To compute the lower bound on the best achievable scanning and filtering performance, we 
evaluate fi{P) and C('^)- By the definitions in ()12p and ()13p . we consider the scalar problem 
of estimation of a random variable X based on its noisy observation Y. Denote by py the 
probability P{Y = 1) and hy px the probability P{X = 1). The best achievable performance. 
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fijjipy), which clearly depends on 5, and, hence, denoted fsipy), is given by 
fsipy) = '^P{x,y)lH[x,gopt{y)) 

y X 

=■* P(y) min P(a;|?/) 
y 

= y~]minP(x, j/) 

y 

= min{px(l - Si),5{l - px)} + min{px'^, (1 - <5)(1 - Px)} 



= min{px, 1 

{b) . ( py — S ^ — Py — ^ 



where (a) results from the optimality of goptiv) and (b) results from the invertability of the 
channel. Consequently, 

C{d) = m&xhbip) s.t. fs{p)<d 

V 

f * d) d<b ^^^^ 
I 1 d>b, 

where /ife(-) is the binary entropy function and 5 * d = d(\ — 5) + (5(1 — (£). Note that since 
5*5 < 1/2 for < 5 < 1/2, there is a discontinuity at d = 5, hence C,{d) is generally not 
concave and C,{(£) ^ C,{d) (although C,{d) can be easily calculated). Figure [1] includes plots of 
both Q[d) and C,[d) for 5 = 0.25. We also mention that d = 5 \s a. realistic cumulative loss in 
non-trivial situations, as there are cases where "say-what-you-see" (and thus suffer a loss 5) 
is the best any filter can do [21]. Furthermore, note that C,{d) is not the maximum entropy 
function ^{d) used in [11] to derive the lower bound on the scandictability. 

Finally, exact evaluation of the bound in Theorem [2] may be difficult in many cases, as the 
entropy ■^^H{Yb) may be hard to calculate, and only bounds on its value can be usedQ At 
the end of Section 13.2.21 we give a numerical example for the bound in Theorem [2] using a 
lower bound on the entropy rate. 

Remark 1. Clearly, C,{d) is interesting only in the region d < 6, as any reasonable filter will have 
an expected normalized cumulative loss smaller or equal to the channel crossover probability. 
However, due to the discontinuity at d = 5, C,{d) is concave for d < 5 but not for d < 5. This 



^Think. for example, of an input process which is a first order Markov source. While the entropy rate of the input 
is known, the output is a hidden Markov process whose entropy rate is unknown in general. 
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5(d) and its upper concave envelope for 8=0.25 



o Upper concave envelope of ^(d) 



0.85 - 



o 



0.15 
d 



Figure 1: The function C{d), as it appears in (fTSll . and its upper concave envelope, C{d), both 
plotted for 6 = 0.25. Note that ({d) and ({d) have analytic expressions, and the plots are discrete 
only to better distinguish between them. 

is fortunate, as if was concave on d < 5, Theorem [2] would have resulted in hi){5 * 6) as an 
upper bound on the entropy rate of any binary source corrupted by a BSC, which is erroneous 
(for example, it violates hi,(TT * 6) as a lower bound on the entropy rate of a first order Markov 
source with transition probability vr corrupted by a BSC with crossover probability 6). 



3.1.2 Gaussian Channel 

Consider now the case where Yv„ is the output of an AWGN channel, whose input is arbitrarily 
distributed. Assume the squared error loss. As the optimal filter is clearly the conditional 
expectation, (^(d) in this case is given by 



C{d) = max{H{X + N):Yav{X\X + N)<d}, Nr^M{0,a%), N ± X. 



(19) 



Since H(X + A^|^) = H{N) is fixed, this is similar to the classical Gaussian channel capacity 
problem, only now the input constraint is Var{X\X + N) < d, which generally depends on the 
distribution of X rather than solely on its variance, and hence is not necessarily achieved by 
Gaussian X. 

When the input is also limited to be Gaussian, however, the optimization problem in 
()13p is trivial and C{d) can be easily calculated (note that in this case (^{d) is valid only to 
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bound the performance for scanning and filtering of Gaussian fields corrupted by AWGN). 
Since the distributions depend only on the variance (assuming zero expectation), we have 
fisiP) = fisi'^v)^ and, in fact. 



- 4- (20) 
ay 



Hence, 



C{d) = max- ln(27re(j?^) s.t. fi^{a^) < d 



(21) 



Unlike the binary setting, here the cumulative loss, d, will be strictly smaller than aj^ for 
any non-trivial setting and reasonable filter, as the error in symbol-by-symbol filtering is 
fj'i -j!g-^ ^ ^'n- C(f^) is convex (U) for d < o"^, and the chain of inequalities in (jl6p 

cannot be tight. 



3.2 Bounds on the Excess Loss of Non-Optimal Scanners 

Theorem [2] gives a lower bound on the optimum scanning and filtering performance. However, 
it is interesting to investigate what is the excess scanning and filtering loss when non-optimal 
scanners are used. Specifically, in this section we address the following question: Suppose 
that, for practical reasons for example, one uses a non-optimal scanner, accompanied with the 
optimal filter for that scan. How large is the excess loss incurred by this scheme with respect 
to optimal scanning and filtering? 

We consider both the case of a Gaussian channel and squared error loss (with Gaussian 
or arbitrarily distributed input) and the case of a binary source passed through a BSC and 
Hamming loss. While the tools we use in order to construct such a bound for the binary 
case are similar to the ones used in |12j . we develop a new set of tools and techniques for the 
Gaussian setting. 

3.2.1 Gaussian Channel 

We investigate the excess scanning and filtering loss when non-optimal scanners are used, 
for the case of arbitrarily distributed input corrupted by a Gaussian channel. We first focus 
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attention on the case where the input is Gaussian as well, and then derive a new results for 
the more general setting. 

Similarly as in [12], the bound is achieved by bounding the absolute difference between 
the scanning and filtering performance of any two scans, ^'^ and assuming both use their 
optimal filters. This bound, however, results from a relation between the performance of 
discrete time filtering and continuous time filtering, together with the fundamental result of 
Duncan [T7j on the relation between mutual information and causal minimal mean square 
error estimation in a Gaussian channel. Namely, we use the mutual information in continuous 
time invariant feature, and the actual value of the excess loss bound results from 

the difference between discrete and continuous time filtering problems, as will be made precise 
below. 

From now on we assume the loss function is the squared error loss, ls{-)- We start with 
several definitions. Let X be a Gaussian random variable, X AA(0,cj^). Consider the 
following two estimation problems: 

• The scalar problem of estimating X based on y = X + N, where N AA(0, cj^), 
independent of X. 

• The continuous time problem of causally estimating Xt = X, t G [0,1], based on Yt, 
which is an AWGN-corrupted version of Xt, the Gaussian noise having a spectral density 



To bound the sensitivity of the scanning and filtering performance, it is beneficial to consider 
the difference between the estimation errors in the above two problems, that is. 



is a sufficient statistics in the estimation of Xt = X, Var(Xt|y*) is equivalent to the squared 
error in estimating X based on X + N, N being a Gaussian random variable, independent of 



level of fj? 





where Y' 



is the continuous time signal {^'}*/=q- Clearly, Var(X|y) 



^i^. Since f^Yt>dt' 



X, with zero mean and variance cr'j^/t. Thus, 




(23) 
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where 



/(x) = ln(l+x)-— -. (24) 
X + 1 



The following is the main result in this sub-section. 

Theorem 3. Let Xv„ be a Gaussian random field with a constant marginal distribution sat- 
isfying Var{Xi) = a\ < oo for all i £ Vn- Let Yi = Xi + Ni, where Ny^ is a white Gaussian 
noise of variance aj^, independent of Xy^. Then, for any two scans and we have 

-2 EL^^i^popt) iXv„,YyJ - EL^^2^popt) {Xy^,YyJ < a%f ( ^ ) • (25) 

Theorem [3] bounds the absolute difference between the scanning and filtering performance 
of any two scanners, ^'^ and assuming they use their optimal filters. Clearly, since the 
scanners are arbitrary, this result can also be interpreted as the difference in performance 
between any scan ^, and the best achievable performance, U{l,QyJ. Note that the bound 
value, af^f is a single letter expression, which depends on the input field Xy^ and the 

noise Ny^ only through their variances. Namely, the bound does not depend on the memory 
in 

Proof (Theorem\3^. As mentioned earlier, the comparison between any two scans is made by 
bounding the normalized cumulative loss of any scan ^ in terms of a scan invariant entity, 
which is the mutual information. 

For simplicity, assume first that the scan ^ is data- independent, namely, it is merely a 
reordering of the entries of Yy^. In this case, {X^^^}"^^^ is a discrete time Gaussian vector. 
We construct from it a continuous time process, {Xjf'^''}jg[o,n2]> where for any t £ [i — l,i), 
xl-'^^ = X^. , i G {1, 2, . . . ,n^}. That is, xf^ is a piecewise constant process, whose constant 
values at intervals of length 1 correspond to the original values of the discrete time vector 
{X^J. Let {y^J and {Y^^^} be the AWGN-corrupted versions on {Xij, .} and x['^\ namely, 

. = X^^^ + N<j,^ and y/*^^ is constructed according to 

dy/"^^ = X^'^di + aNdWt, t e [0, n^], (26) 

where Wt is a standard Brownian motion. Observe that the white Gaussian noise, aisfdWt, has 
a spectral density of level af^, similar to the variance of the discrete time noise Ny^. Since we 
switch from discrete time to continuous time, it is important to note that the noise value in 
the two problems is equivalent. That is, if the discrete time field Xy^ is corrupted by noise of 
variance aj^, then we wish the continuous time white noise to have a spectrum such that the 
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integral over an interval of length 1, whose integrand is the continuous output Y^ '^^ (and thus 
is a sufficient statistics in order to estimate the piecewise-continuous input Xt in this interval), 
will be a random variable which is exactly + N^^, having a variance of c^. 
We have, 



n 



1 " 

-^j;var(x*jy*; 



n 

i=l 



(a) J_ 
t2 



E 



1=1 



Var (x.i,jy;;-\ {yj^)}t,e[,_i,,_i+t]) dt 



Var(X^jy^;' 



i=i 



TV 



= —^^N^ [i^t he[o,n2];Ut he[o,n2] ) -o^iv/ 2 

n \ ^ \ '^N J 

= \2all{{X^^}-{Y^^})-alf 

@V^/(X,,;lVJ-.^/(^). (27) 

The equality (a) is from the application of (j23p with X = Xx^^\Y^^~^ , i.e., with Xij,- dis- 
tributed conditioned on Y^'^~^ . Note conditioned on Y^^^^ , is indeed Gaussian, and 
that ([23|) applies to an?/ Gaussian X corrupted by Gaussian noise. The inequality (b) is since 
Var(Xijf- ly^^'"^) < Var(Xi) and due to the increasing monotonicity of /, (c) is since the result- 
ing integral from to n? is simply the minimal mean square error in filtering {y/^^^l (as yv^, IS a 
sufficient statistics with respect to {yjl'^^}(/g[j_i^,j_i+(]), and the application of Duncan's result 
[171 Theorem 3]. Finally, (d) is since the mutual information is invariant to the reordering of 
the random variables. To complete the proof of Theorem [3l simply note that since f{x) is 
non-negative for x > 0, by (a) above, the normalized cumulative loss can be upper bounded 
as well, that is. 



Eq^.L,^^P^,,. {Xv^,YvJ < -2 E / (x^^Y^;-\{Yi'^}t>^[,^^^,^,+^ dt (28) 
n n Jo ^ ' 



hence, similarly as in the chain of inequalities leading to ()27p . 



i^V.^yy.) < ^^olnXy^-Yy:) . (29) 
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In fact, equation ([29|) can be viewed as the scanning and filtering analogue of [HI eq. (156a)]. 

Now, if the scan ^ is data-dependent, the above derivations apply, with the use of the 
smoothing property of conditional expectation. That is, conditioned on Y^^~^ , the position 
is fixed (assuming deterministic scanners, though random scanning order can be tackled with 
a similar method), relation (a) in (j27p holds since it holds conditioned on Y^_^~^ , and relation 
(c) holds as the mutual information is invariant under data-dependent reordering as well. This 
is very similar to the methods used in the proof of [12^ Proposition 13], where it was shown 
that the entropy of a vector is invariant to data-dependent reordering. □ 

At this point, a few remarks are in order. A very simple bound, applicable to arbitrarily 
distributed fields and under squared error loss (yet interesting mainly in the Gaussian regime) 
results from noting that for any random variables X and Y = X + N , 

2 

< Var(X|y) <a% ^ 2 • (30) 

2 

Namely, simple symbol by symbol restoration results in a cumulative loss of at most a'ir ^ , 
and we have, 

^EL^^^^^,,^{{X,]U,{Y^]U) = ij;Var(X^jy^*;) 

i 

< i j;Var(X^jy^J 

i 

= ^l^^- (31) 
Thus, the excess loss in non-optimal scanning cannot be greater than that value, hence. 



1 

V? 



^2 

EL,^i poptAXv„,YvJ - EL,^2 poptAXv„,YvJ < cr% 2 f 2 • (3^) 

^X + 



In the next sub-section, we derive a tighter bound than the bound in (|32p . applicable to 
arbitrarily distributed noise-free fields. However, since this bound may be harder to evaluate, 
it is interesting to discuss the properties of ([32]) as well. 

Both the bound in Theorem[3]and the bound in (|32l) are in the form of Var(Xi)(7(SNR), for 
some g, where SNR = cr^/o"^. This means that any bound obtained for a certain SNR applies 
to all values of Var(A'i) by rescaling. The bound in Theorem [3] has the form Var(Xi)^^^^^^, 
where /(•) was defined in ([2^ . and we have 

SNR^O+ SNR SNR^oo SNR ' ^ ^ 
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Bounds on the scantering sensitivity - Gaussian input 



>f 0.6 



- Symbol-by-symboi bound (arbitrary distributions) 
— Gaussian input and Gaussian noise 



4 6 

SNR (Var(X)/Var(N)) 



10 



Figure 2: Bounds on the excess loss in scanning and filtering of Gaussian input corrupted by AWGN. 
The solid line is the bound given in Theorem [31 The dashed line is the bound given in (!32|) . 



that is, the scan is inconsequential at very high or very low SNR. This is clear as at high SNR 
the current observation is by far the most influential, and whatever previous observations used 
is inconsequential. For low SNR, the cumulative loss is high whatever the scan is. Unlike the 
bound in Theorem [3l (1320 does not predict the correct behavior for SNR — > 0"'', and is mainly 
interesting in the high SNR regime. 

The above observations are also evident in Figure [21 which includes both the bound given 
in Theorem [3l applicable to Gaussian fields, and ()32l) . applicable to arbitrarily distributed 
fields. It is also evident that in the case of Gaussian fields, '^s^^''* has a unique maximum of 
approximately 0.216, that is, the excess loss due to a suboptimal scan at any SNR is upper 
bounded by 0.216Var(Xi). 

Remark 2. It is clear from the proof of Theorem [3] that an upper bound on the expression in 
(|22p . valid for arbitrarily distributed input X, may yield an upper bound on the excess scanning 
and filtering loss which is also valid for arbitrarily distributed random fields. However, while 
the integral in ()22p can be upper bounded by assuming a Gaussian X, Var(X|y) has no non- 
trivial lower bound. In fact, in [22], it is shown that if X is the following binary random 
variable, 

w.p. p, 



X 



p 



'iqs w.p. 1 -p. 



(34) 
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for which EX = and EX'^ = 1, then we have 



1 _zx!3l 



Var{X\Y) <— -e mi-p) , (35) 

2p(l - p) 

which can be arbitrarily close to for small enough p. Thus, the only lower bound on y(X|y) 
which is valid for any X with EX^ < oo, and depends only on cj^ and ajj, is (and hence 
results is a bound weaker than Theorem [3] or (j32p ). 

In the next two subsections, we derive new bounds on the excess loss, which are valid for 
more general input fields. First, we generalize the bound in Theorem [3l While the result may 
be complex to evaluate in its general form, we show that for binary input fields the bound 
admits a simple form. We then show that if the input alphabet is continuous, then a non-trivial 
bound on Var(X|y) can be derived easily, which, in turn, results in a new bound on the excess 
loss. 

A Generalization of Theorem [3l A generalization of Theorem [3] results from revisiting 
equality (a) of (f27l) . which is simply the application of (f23]) with X = A^,-|y^^'~\ While it 
is clear that an expression similar to that in (j23p can be computed for non-Gaussian X, it is 
not clear that A,j,jy^^'"^ has the same distribution for any 1 < i < "n? (unlike the Gaussian 
setting, where X^^-\Y^^^^ is always Gaussian). Nevertheless, using the definition below, one 
can generalize Theorem [3] for arbitrarily distributed inputs as follows. 

For any ( Ay„ , 1V„ ) > where Yv„ is the AWGN-corrupted version of Xy^ , define 

/*(Ay„ , al) = max I /' Var (x^^ |y*;-^ , {Y^^^, dt - Var (x^Jlf ;) | . 

(36) 

Theorem 4. Let Xv„ be an arbitrarily distributed random field, with a constant marginal 
distribution satisfying Var{Xi) = a\ < oo for all i £ Vn- Let Yi = Xi + Ni, where Nv„ is a 
white Gaussian noise of variance af^, independent of Xy^. Then, for any two scans '^^ and 
we have 

-E'L(^i^^opt) (Ay„, 1V„) - £'L(^2^^opt) (Ay„, 1V„) < o-ir)- (37) 

The proof of Theorem H] is similar to that of Theorem [3l and appears in Appendix lA.li 
Note that f*(Xy^,a\j) is scan-independent, as it includes a maximization over all possible 
scans. At first sight, it seems like this maximization may take the sting out of the excess loss 
bound. However, as the example below shows, at least for the interesting scenario of binary 
input, this is not the case. 
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First, however, a few more general remarks are in order. Since important insight can be 
gained when using the results of Guo, Shamai and Verdii [18], let us mention the setting used 



therein. In [18], one wishes to estimate X based on \/SNRX + where is a standard 
normal random variable. Denote by I(SNR) and mmse(SNR) the mutual information between 
X and VSNRX + N , and the minimal mean square error in estimating X based on \/SNRX + 
N, respectively. Note that Var(X|y) in our setting equals cr^ mmse(o"^/(T|^). Under these 
definitions, 

^ I(SNR) = ^mmse(SNR), (38) 



dSNR ' ' 2 
or, equivalently, 

1 /-SNR 

I(SNR) = - / mmse(7)d7. (39) 
Consequently, the result of Theorem [3] can be restated as 

^ |i?L(^i {Xv^,YvJ - EL^^,^^^,,^ {Xvr.,Yv,j\ < 2a^I(SNR) - cTimmse(SNR), (40) 

where I(SNR) = ^ ln(l + SNR) and mmse(SNR) = j^_^_gj^j^ are simply the mutual information 
and minimal mean square error of the scalar problem (hence, a single letter expression) of 



estimating a Gaussian X based on \/SNRX + N, where standard Gaussian. In fact, the 
bound in Theorem H] will always have the form 2(T^I(SNR) — (7^mmse(SNR), for some X* 
whose distribution is the maximizing distribution in (I36p . The next example shows that this 
is indeed the case for binary input as well, and the resulting bound can be easily computed. 

Example 1 {Binary input and AWGN). Consider the case where Xv„ is a binary random field, 
with a symmetric marginal distribution (that is, P{Xq = ax) = P{Xo = —crx) = 1/2)- Note 
that the Xis are not necessarily i.i.d., and any dependence between them is possible. Yv„ 
is the AWGN-corrupted version of Xy,^- To evaluate the bound in Theorem HJ f*{Xy^,a'jq) 
should be calculated. However, for any scan ^ and time i, Xij,. |y^^'~^ is still a binary random 
variable, taking the values ±cjx with probabilities (p, 1 — p), for some < p < 1/2. Hence, 



nXv^^al) < max | C Var(Xt|y*)dt - Var(X|y) 

0<P<l/2 I^Jo 



(41) 



where X is a binary random variable, taking the values ztux with probabilities (p, 1 — p), 
Xt = X,Y = X + N and Yj is the AWGN-corrupted version of Xt. The following result holds 
for any random variable X. 

Claim 5. For any random variable X with Var{X) = a\ < oo, the expression in (j22p is 
monotonically increasing in a\. 
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Proof. We have, 



[ Var(X|y*)dt- Var(X|y) = /" aj^mmse{aj^ / a%t)dt - aj^mmse{aj^ / a%] 
Jo Jo 



(7j^ / mmse(7)d7 — (Tj^mmse((Tj(^/(T^). (42) 



Thus, 



„ , [ Var(Xt|y*)di- Var(X|y)^ = -a|--^mmse((T|/(j 
dcr^ \Jo J daj. 



2 /_2 \ 

n) 



d^ 



= -2SNR ^I(SNR) 

dSNR2 ^ ^ 

> 0, (43) 

where the last inequahty is by |18[ Corollary 1]. □ 

Claim [5] simply states that the monotonicity of /(•) used in inequality (b) of ()27p is not 
specific for Gaussian input, and holds for any X. Thus, by Claim [U the term in the braces of 
equation (|4ip is monotonically increasing in the variance of X, which is simply Aa\{p — p^)- 
Thus, it is maximized hy p = 1/2, and we have 

r{Xv^,al) = 2a^I(SNR) - cTimmse(SNR), (44) 

where I(SNR) and mmse(SNR) are the mutual information and minimal mean square error in 
the estimation of X based on VSNRX + A^, where X is binary symmetric and is a standard 
normal. Since the conditional mean estimate in this problem is tanh(\/SNRy), we have |18j 

1 roo 2 

I(SNR) = SNR - —= / e"^ In cosh(SNR - VSNRy)dy, 
\/27r J -oo 

and 

1 roo 2 

mmse(SNR) = 1 - —= / e"^ tanh(SNR - VSNRy)dy, (46) 
\/27r j-oo 

so the bound can be computed numerically. The above bound is plotted in Figure [3l Similarly 
to the case of Gaussian input, it is insightful to compare this bound to a simple symbol-by- 
symbol filtering bound. That is, since for any binary X corrupted by AWGN of variance cj^, 
< Var(X|y) < cj|mmse(SNR), we have 

^ EL^^i^p.pt) {Xv„,YvJ - EL^^2^popt) {Xv„,YvJ < o-|mmse(SNR), (47) 

where mmse(SNR) is given in (j46p . This is simply the analogue of (j32p to the binary input 
setting. 



(45) 
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Figure 3: Bounds on the excess loss in scanning and filtering of binary input fields corrupted by 
AWGN. The solid line is the bound given in (H4l) (that is, Theorem H]), and the dashed line is the 
symbol-by-symbol bound given in (1471) . 

A Bound for Arbitrarily Distributed Continuous Input. In this sub-section, we derive 
an additional bound on the excess scanning and filtering loss under squared error. We assume, 
however, that the input random field Xy^ is over M^", and that Xi\Yy^ has a finite differential 
entropy for any i £ Vn (roughly speaking, this means that in the denoising problem of Xj, 
Xi\Yv„ is a non-degenerated continuous random variable). Under the above assumptions, we 
derive an excess loss bound which is not only valid for non-Gaussian input, but also depends 
on the memory in the random field {Xy^,Yv^). On the other hand, it is important to note 
that the bound below is mainly asymptotic, and may be much harder to evaluate compared 
to the bounds in Theorem [3] or ()32p . 

By |23^ Theorem 9.6.5], for any X,Y with a finite conditional differential entropy H{X\Y), 

Var(X|y) > exp {2H{X\Y)} . (48) 
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Thus, 



i=l 

1 ^ 1 * 

i=l 



(6) 1 "' 1 



,2 



71^ 27re 



i=l 



27re I n 



i=l 



where (a) is by applying (jl8|) with y = Y"^^' , (b) is since conditioning reduces entropy and (c) 
is by applying Jensen's inequality. 

The expression ^ -^(-'^'I'jy^j"^ ) equals -^"^^^i H{Xq,ijY^"^) for any two scanners 

^ and since equality holds even without the expectation implicit in the entropy function. 
Thus, it is scan-invariant. Define 

H+iX\Y) = hminf-L ^^(^^l^yj. (50) 

H^{X\Y) can be seen as the asymptotic normalized entropy in the denoising problem of {X} 
based on its noisy observations. Note that the entropies in (j50p are differential. The following 
proposition gives a new lower bound on the excess scanning and filtering loss under squared 
error. 

Proposition 6. Let Xy^^ he an arbitrarily distributed continuous valued random field with 
Var{Xi) = (T^ for all i. Let Yi = Xi + N^, where Ny^ is a white noise of variance aj^, 
independent of Xy^. Assume that Xi\Yy^ has a finite differential entropy for any i & Vn- 
Then, for any two scans and we have 



lim inf — — 

n— »oo \V„ 



ELf^^i^popv^ {Xy^,Yy^) - EL^^2^popt) {Xy^,Yy^] 



a\ja\ 1 



<^-5^-P{2^^(^l^)}^ (51) 

Proof. The proof follows directly by applying the lower bound on the scanning and filtering 
performance given in (j49|) and the upper bound in □ 

The bound in Proposition [6] is always at least as tight as the bound in (j32p (and thus tighter 
than the bound in Theorem [3] for high SNR). For example, if the estimation error of Xi given 
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Yv„ tends to zero as n increases (as in the case where Xi = X for ah i), then exp {2H'^{X\Y)} = 
0. However, if Xi cannot be reconstructed completely by Yv„ , then the bound may be tighter 
than (1321) . It is far from being a tight bound on the excess loss, though. In the extreme case 

2 2 

where all Xj's are i.i.d., the excess loss bound in Proposition [6] is -z^.^i — Var(Xi|yi) > (for 
non Gaussian X), while it is clear that all reasonable scanner-filter pairs perform the same. 
Finally, note that any lower bound on H^{X\Y) results in an upper bound on the scanning 
and filtering excess loss. For example, since 

H+{X\Y) > i7(Xo|y„\, X_fc„i, Xfc+i) (52) 

for any finite k, one can compute a simple upper bound on the excess loss, at least for first 
order Markov {X}. 

3.2.2 Binary Input and BSC 

Unlike the Gaussian setting discussed in Section 13.2.11 where the bound on the excess loss 
resulted from a continuous-time equality, with the mutual information serving as the scan- 
invariant feature, in the case of binary input and a BSC the entropy of the random field will 
play the key role, similar to [12]. As given in Section [3.11 the best achievable performance (in 
the scalar problem) is given by 

/.(P)=min|^,i^^,5}, (53) 

where p is the probability that the channel output is 1 and 5 is the channel crossover proba- 
bility. Note that fs{p) is not the Bayes envelope associated with estimating Xt using Yt under 
Hamming loss. However, as is clear from the derivations in (jl7p . and will be evident from 
the proof of the following theorem, f5{P{yt\y*^^)) is the expectation of the Bayes envelope 
(associated with estimating Xt using Y* under Hamming loss) with respect to the distribution 
P{yt\y^"^). Define 

€5 = min max \ahf,{p) + b — fs{p)\ . (54) 

a,b <5<p<l/2 

The following is the main result in this sub-section. 

Theorem 7. Let Yb be the output of a BSC with crossover probability S whose input is Xb- 
Then, for any scanner-filter pair (^', F°^^), where F°^^ is the optimal filter for the scan ^ , we 
have 

^Eq^L^^^^,,{Xb,Yb) -U{Ih,Qb) <2€s. (55) 
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The 2ej bound compared to the simple singlet bound 




0.05 0.1 0.15 0.2 0.25 0.3 



Figure 4: Bounds on the excess loss in scanning and filtering of binary random fields corrupted by 
a BSC. The solid line is the bound in Theorem [7] {2es), and the dashed line is the singlet bound 
{fs{l/2)=6). 

Even without evaluating es explicitly, it is easy to see that the excess loss when using non 
optimal scanners is quite small in this binary filtering scenario. For example, for 5 = 0.1 and 
6 = 0.25 we have < 0.035 and < 0.03 respectively, yielding a maximal loss of 0.07 or 
even 0.06. Figure S] includes the value of 2es as a function of 5. Similarly to Section 13.2. H 
it is compared to a simple bound on the excess loss which results from simply bounding the 
Hamming loss of any filter by from below and 5 from above (namely, 5 is the resulting 
bound on the excess loss). The values in Figure U] should also be compared to 0.16, which is 
the bound on the excess loss in the clean prediction scenario [12], or even to larger values in 
the noisy prediction scenario, to be discussed in the next section. The fact that the filtering 
problem is less sensitive to the scanning order is quite clear as the noisy observation of X^l,^ 
is available under any scan. Finally, it is not hard to show that in the limits of 5 ^ and 
6^1/2 (high and low SNR, respectively), we have es 0, which is expected, as the scanning 
is inconsequential in these cases (note, however, that the singlet bound, 5, does not predict 
the correct behavior at low SNR). 

Proof (Theorem^. We first show that for any arbitrarily distributed binary n-tuple X" and 
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a*^if(y") + b*s- ^^L°^*(X",y") 



< e<5, 



on and a| and 6| are the minimizers of in 



(56) 



any < 5 < 1/2 

-if(y") + 6J - 

^ n n 

where EL°^{X^ , y") is the expected cumulative Hamming loss in optimally filtering X" based 

Indeed, 



n 



t=l yt-l 



xt,yt 



1 " 

t=l j/'-i xt,yt 
1 " 



t=l yt^l 



yt 



(57) 



Consider the summation Ylyt P(yt\y^^^)Ylxt P(^t\y^^^^ [xt, F°^^{y^)j ■ As F is optimal, the 
inner sum equals at most min{P(x4 = 0\y^),P{xt = l|y*)}- Thus, similar to the derivations in 
(flTll . we have 



yt xt 

= E^(yt|y*~')min{^(^t = ^\y').P{xt = i|y*)} 



yt 

min 



[ P(l/f = 0|l/*'i)-,5 P{yt = l\y'~^)-5 
\ 1-25 ' 1-2(5 



(58) 



Let p = -P(?/t = 0|y ). Note that min{p, 1 — p} > 5- We have 

-//(y") + 6| - 

n n 



a^hip) +b} - min 



p — 5 1— p — 5 
1-25' 1-26 



n 

< - E E ^(2^*^') .i^^^.. ) + - 

™ t=l yt^l 5<P<V2 

= £5, (59) 

which establishes (I56p . However, the same inequality can be proved for any reordering of the 
data ^' (similar to the proof of [121 Proposition 13]), consequently. 



(60) 
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Using (j60p . remembering that H(Yb) = H(^(Yb)) for any ^, and applying the triangle 
inequality results in ([55]) . □ 

Note that analogous ideas were used by Verdii and Weissman to bound the absolute dif- 
ference between the denoisability and erasure entropy [24j . 

Theorem [2] gives a lower bound on the best achievable scanning and filtering performance. 
Theorems [3l H] and [7] give an upper bound on the maximal possible difference between the 
normalized cumulative loss of any two scanners (accompanied by the optimal filters), or any 
one scanner compared to the optimal scan. Although Theorem [2] is similar to the results of 

, even for the relatively simple examples of a Gaussian field through a Gaussian memoryless 
channel or a binary source through a binary symmetric channel we have no results which can 
parallel [11^ Theorem 17] or [111 Corollary 21], i.e., give an example of an optimal scanner-filter 
pair for a certain scenario. However, as the next example shows, we can identify situations 
when scanning and filtering improves the filtering results, i.e., non trivial scanning of the data 
results in strictly better restoration. Moreover, the example below illustrates the use of the 
results derived in this section. 

Example 2 ( One Dimensional Binary Markov Source and the BSC) . In this case, it is not too 
hard to construct a scheme in which non-trivial scanning improves the filtering performance. 
In [21], Ordentlich and Weissman study the optimality of symbol by symbol (singlet) filtering 
and decoding. That is, the regions (depending on the source and channel parameters) where 
a memoryless scheme to estimate Xi is optimal with respect to causal (filtering) or non causal 
(denoising) non- memoryless schemes. Clearly, in the regions where singlet denoising is optimal 
(a fortiori singlet filtering), scanning cannot improve the filtering performance. However, 
consider the region where singlet filtering is optimal, yet singlet denoising is not. In this 
region, there exists k for which the estimation error in estimating Xi based on is strictly 

smaller than that based on (as the optimal filter is memoryless yet the optimal denoiser 
is not). Hence, a scanner which in the first pass scans k contiguous symbols, then skips one, 
etc., and in the second pass returns to fill in the holes, accompanied by singlet filtering in the 
first pass and non-memoryless in the second, has strictly better filtering performance than the 
trivial scanner. 

For a binary symmetric Markov source with a transition probability vr < ^, corrupted by 
a BSC with crossover probability 5, \21\ Corollary 3] asserts that singlet filtering ("say-what- 
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you-see" scheme in this case) is optimal if and only if 



5</(vr) = -(l- Vmax{l-4^,0}). (61) 
Singlet denoising, on the other hand, is optimal if and only if 



6 < d(7r) 




Consider a scanner-filter pair which scans the data using an "odds-then-evens" scheme. On the 
odds, "say-what-you-see" filtering is used. On the evens, Y^^^ are used in order to estimate 
The results are in Figures [5] and [6l In Figure O the points marked with "x" are where 
the "odds-then-evens" scan improves on the trivial scan. The two curves are /(vr) and d{TT). 
Figure [6] shows the actual improvement made by the "odds-then-evens" scanning and filtering. 

For 5 = vr = 0.1, for example, the "odds-then-evens" error rate is smaller than that of 
filtering with the trivial scan by 0.021 (that is, 0.079 compared to 0.1). This value should 
be put alongside the upper bound on the excess loss given in Theorem [71 which is smaller 
than 0.07 in this case. To evaluate the bound on the best achievable scanning and filtering 
performance given in Theorem [2] for this example (denoted, with a slight abuse of notation, as 
U (vr, 6)), we have 

> C\h{7T*6)), (63) 

where H{tt,5) is the entropy rate of the output, which is in turn lower bounded by ^^(vr * 5). 
The resulting bound for vr = 5 = 0.1 is approximately 0.04. 

Thus, there exist non-trivial scanning and filtering schemes (i.e., lower bounds) whose 
improvement on the trivial scanning order is of the same order of magnitude as the upper 
bound in Theorem [7l To conclude, it is clear that there is a wide region were a non trivial 
scanning order improves on the trivial scan, an that this region includes at least all the region 
between /(vr) and ^(vr). Yet, it is not clear what is the optimal scanner-filter pair. 

3.3 Universal Scanning and Filtering of Noisy Data Arrays 

In [19], Weissman et. al. mention that the problems involving sequential decision making on 
noisy data are not fundamentally different from their noiseless analogue, and in fact can be 
^This is to have few simple steps in the forward-backward algorithm [251 Section 5] which is required to compute 
P{xi\if'^^\). The generalization to Y^^j^ is straightforward. 
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Where can non-trivial scanning improve tlie optimal filtering performance? 
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Figure 5: Where can a simple (suboptimal) "odds-then-evens" scan improve on the trivial scanning 
order and optimal filtering scheme, n is the transition probability of the symmetric, first order, 
Markov source and 6 is the channel crossover probability. 

reduced to the noiseless setting using a properly modified loss function. Indeed, this property 
of the noisy setting was used throughout the literature, and in this work. The problem of 
filtering a noisy data sequence is not different in this sense, and it is possible to construct a 
modified loss function such that the filtering problem is transformed into a prediction problem 
(with a few important exceptions to be discussed later). Such a modified loss function and a 
'filtering-prediction transformation' is discussed in pTO]. We briefiy review this transformation, 
and consider its use in universal filtering of noisy data arrays. 

First, we slightly generalize our notion of a filter. For a random variable Ut uniformly 
distributed on [0, 1], let Xt{y^~^ytj Ut) G A denote the output of the filter X at time t, after 
observing y*. That is, the filter X also views an auxiliary random variable, on which it can base 
its output, Xtiy^~^yt, Ut). We also generalize the prediction space to M{S), S = {s : N h-^ A}. 
I.e., the prediction space is a distribution on the set of functions from the noisy observations 
alphabet to the clean signal alphabet A. We assume an invertible discrete memoryless 
channel. 

For each filter X, the corresponding predictor is defined by 

Ft''{y'-')[s] = P {Xt{y'~^y, Ut) = s{y) Vy G iv) . (64) 
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The difference between optimal filtering and "odds-then-evens" scantering. 



0.025 



0.02 



- 0.015. 




Figure 6: The actual difference between the optimal filtering error rate and the "odds-then-evens" 
scanning and filtering error rate, vr is the transition probability of the symmetric, first order, Markov 
source and S is the channel crossover probability. Only values for which 6 < /{it) are shown. 
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The analogous 'prediction-filtering transformation' is 

j-i j 
X[{y\ut) = a,eA if ^ E < < E E ^t{y'-')[s], (65) 

i=0 s:s{yt)=ai i=0 s:s{yt)=ai 

where the subscript i reflects some enumeration of A. Under the above definitions, |19l Theo- 
rem 4] states that for all n, x" G and any predictor F, 

EL^p{x^,Y'^) = EL'f{Y'^), (66) 

where L-^p{x^ ,Y'^) is the cumulative loss of the filter under the original loss function / and 
L'p{Y"') is the cumulative loss of the predictor under a modified loss function I', which depends 
on the original loss / and the channel crossover probabilities. 

This result can be used for universal filtering, under invertible discrete memoryless channels, 
in the following way. For each finite set of filters, construct the corresponding set of predictors, 
then use the well known results in universal prediction in order to construct a universal predic- 
tor for that set. Finally, construct the universal filter using the "inverse" prediction- filtering 
transformation. Analogously, the results on universal finite set scandiction given in [12] can be 
used to construct universal scanner-filter pairs. Note, however, that the modified loss function 
/' may be much more complex to handle compared to the original one. For example, it may 
not be a function of the difference xt — Ft, even if the original loss function is. Nevertheless, 
the results in [12] apply to any bounded loss function, and thus can be utilized. 

4 Scandiction of Noisy Data Arrays 

In this section, we consider a scenario similar to that of Section[3l only now, for each t, the data 
is not available in the estimation of Xxi,^, namely. Ft = ^^(1^^^, . . . ,Y^^_-^^), as opposed to 
Ft = Ft(y$j, . . . , l*t) in the filtering scenario. We refer to this scenario as "noisy scandiction", 
analogous to the noisy prediction problems discussed in [TJ] and [15j . 

We first assume the joint probability distribution of the underlying field and noisy obser- 
vations, Q, is known, and examine the settings of Gaussian fields under squared error loss and 
binary fields under Hamming loss. In these cases, we characterize the noisy scandictability 
and the achieving scandictors in terms on the "clean" scandictability of the noisy data. We 
then consider universal scandiction for the noisy setting, show that this is indeed possible for 
finite scandictor set and for the class of all stationary binary fields corrupted by binary noise. 
Finally, we derive bounds on the excess loss when non optimal scanners are used (yet, with 
the optimal predictor for each scan). 
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4.1 Noisy Scandictability 

Throughout this section, it wih be beneficial to consider also the clean scandictability as 
defined in |1H Definition 2], that is, when the scandictor is judged with respect to the same 
random field it observes. Thus, for (X, Y) governed by the probability measure Q, Qy denotes 
the marginal measure of {Y}, and therefore U{1,Qy) refers to the clean scandictability of Y, 
i.e., 

\B\ 

L(^,f){vb) = ^/(2/*,,Ft(?/q,,,... (67) 
t=i 

and 

U{1,Qy) = lim inf EQ,.-^L^^^F){yB)- (68) 

As mentioned earlier, in this section we relate the noisy scandictability, U{l,Q), to the clean 
scandictability of the noisy field, U{1,Qy)- This relation can be used to derive bounds on 
U{l,Q) using the bounds on U{1,Qy) derived in [11]. However, this should be done carefully. 
For example, the lower and upper bounds given in [11^ Theorem 9] are applicable only when 
X has an autoregressive representation (with respect to some scandictor) with independent 
innovations. Unfortunately, Y = X -\- N does not necessarily have this representation, and 
the bounds do not apply to y in a straightforward mannerO Yet, a simple generalization 
of the lower bound in [llj . valid for arbitrarily distributed random fields, can be derived 
using the same method used in the proof of Theorem [2l To this end, we briefly describe this 
generalization. 



Let 



and further define 



B(P) =minJ^/(y,y)P(y), (69) 



y 
y 



-i{d) = inax{H{P) : B{P) < d}. (70) 
Similarly as in Section \37\] denote by ^{d) the upper concave envelope of 7(d). 
Corollary 8. For any random field Yb and any scandictor (^', F) for Yb, 

7(%,p)(1b)) > ^^H{Yb). (71) 



■^Note that the restriction to autoregressive fields is merely technical, i.e., it facilitates the proof of the lower 
bound in the sense that a weak AEP-like theorem is required. The essence of the lower bound, however, which is a 
volume preservation argument, is valid for non autoregressive fields as well. 
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Proof. The proof is similar to that of Theorem [2j We have, 



H{Yb) = H{^{Yb)) 



t=i 

\B\ 



t=l y9t-l 

- (72) 

□ 

The lower bound in Corollary [8] strengthens the bound in |1H Theorem 9] since it applies 
to general loss functions, arbitrarily distributed random fields, and is non-asymptotic. When 
A = M and the loss function is of the form l{x,F) = p(x — F), where p{z) is monotonically 
increasing for z > 0, monotonically decreasing for z > 0, satisfies p{0) = and J e~^P^^^dz < oo 
for every s > 0, the above bound coincides with that of [Tl]. In that case, ^{d) = ^{d), which 
is in turn the one sided Fenchel-Legendre transform of the log moment generating function 
associated with p (See |1H Section III] for the details). For example, when p{z) = z^, we have, 
j{d) = ^ln{2'Ked), d > and ^"^{h) = ^f^e'^'^, h > 0. Similar results can be derived for 
binary alphabet, thus, when p{z) is the Hamming loss function, 'y{d) = h{,{d). 

We now turn to discuss the noisy scnadictability, U{1, Q). The following lemma, proved in 
Appendix lA. 21 describes the noisy scandictability for any additive white noise channel model 
and the squared error loss function, /«(•), in terms of the clean scandictability of y, and gives 
the optimal scandictor. 

Lemma 9. Let {{Xt, Yt)}i^^2 be a random field governed by a probability measure Q such that 
Yf = Xt + Nt, where Nt, t G Z^, are i.i.d. random variables with Var{Nt) = aj^ < oo. Then 

U{Is,Q) = U{Is,Qy)-<t%. (73) 

Furthermore, U{ls,Q) is achieved by the scandictor which achieves U{Is,Qy)- 

Actually, Lemma [9] is only scarcely related to scanning. It merely states that in the pre- 
diction of a process based on its noisy observations, under the additive model stated above 
and squared error loss, the optimal predictor is one which disregards the noise, and attempts 
to predict the next noisy outcome. Similar results for binary processes through a BSC were 
given in [16] and will be discussed later. 
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Finally, we mention that the method used in the proof of Lemma [9] is specific for the square 
error loss function. For a general loss function, one can use conditional expectation in order to 
compute the noisy scandictability, under a modified loss function p. Specifically, for a random 
field X, denote by a{Xv„) the smallest sigma algebra with respect to which Xv„ is measurable. 
Let denote a scanner for Vn and denote by J^*" the information available to the scandictor 
at the t 'th step, that is 

•^f " = ^ . . ■ ■ (74) 

Note that the set of sites ^i, • • • > is itself random, yet for each t, is J't^i measurable 
(if ^' is random, namely, it uses additional independent random variables, the definition of J-^" 
is altered accordingly). Hence, the filtration {J-f"}^^^j represents the gathered knowledge at 
the scandictor. We have 

^QB„T^\T.piF't-Y^,) = EQ,^—-S2EQ,^{p{Ft-X^,-N^,)\^l„aiX^,)} 

I n\ I n\ 

. \Bn\ 

= EQ,^—-^p{Ft-X^,), (75) 

I n\ 

for some p. Thus, if Z(X<j,j,F() is the required loss function in the noisy prediction problem 
of {X}, one has to seek a function p{-, •) such that p{xii,^,Ft) = l{x,;/^,Ft) for all x^^ and Ft. 
If such a function is found, then surely Ep{Y^^, Ft) = El{X^^, Ft) and the optimal scandictor 
for the noisy prediction problem is the one which is optimal for the clean prediction problem 
of {Y} under p. While this is simple for the squared error loss function and additive noise 
(choose p{y — F) = {y — F)"^ — cr^), or Hamming loss and BSC (choose p{y,F) = ^^^{ll^] ^ ) 
this is not always the case for a general loss function. It is also important to note that in the 
case of white noise considred in this paper, the condition on the modified loss function p can 
be stated in a single letter expresion, namly, if 1{X,F) is the required loss function for the 
noisy scandiction problem, p should satisfy E{p{Y, F)\a{X)} = 1{X,F). 

4.1.1 Gaussian Random Fields 

Let both X and N be Gaussian random fields, where the components of N are i.i.d. and 
independent of X. That is, Y is the output of an AWGN channel, with a Gaussian input X. 
In this scenario, similarly to the clean one, the noisy scandictability is known exactly and is 
given by a single letter expression. 

Before we proceed, several definitions are required. For any t ^ 1? and y C Z^, denote 
by Xtiy) the best linear predictor of Xt given {Xf^t'&v- A subset 5" C is called a half 
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plane if it is closed to addition and satisfies 5U {—S) = H? and S'n (— -S) = {0}. For example, 
'S'lex = {(ra^n) G : [m > 0] or [m = 0, n > 0]} is a half plane. Let X be a wide sense 
stationary random field and denote by g the density function associated with the absolutely 
continuous component in the Lebesgue decomposition of its spectral measure. Then, for any 
half plane 5, we have {261 Theorem 1], 

i?(Xo-Xo(-5\{0}))' = expj^^^^ ^^ln5(A)dA| 

= (76) 

We can now state the following corollary, regarding the noisy scandictability in the Gaussian 
regime and squared error loss, which is a direct application of Lemma [9] and the results of [IH 
Section IV]. 

Corollary 10. Let {(^t, ^ random field governed by a probability measure Q such 

that Yf = Xf + Nf, where X is a stationary Gaussian random field, Nt, t G Z^, is an AWGN, 
independent of {Xt}^^^^ . Then, the noisy scandictability of Q under the squared error loss is 
given by 

U{ls,Q)=al{Y)-a%. (77) 

Furthermore, U{ls, Q) is asymptotically achieved by a scandictor which scans [Xt, Yt) according 
to the total order defined by any half-plane S and applies the corresponding best linear predictor 
for the next outcome ofY. 

For any stationary Gaussian process X, it has been shown by Kolmogorov (see for example 
[27] ) that the entropy rate is given by 

= i In (27re) + lng{X)dX. (78) 

Thus, using the one-dimensional analogue of (j76p . for a stationary Gaussian process X we 
have, 

= ^lni2neal{X)). (79) 
In fact, (|79p applies for stationary Gaussian random fields as well. Thus, we have, 

U{ls,Q) = cTl{Y)-a% 

= -^e'^^-^e'^\ (80) 
27re 27re ' ^ ^ 

where is the entropy rate of Y and is the entropy of each Nt. From the entropy power 
inequality [23l pp. 496], we have, 

2ne - 27re 27re ' ^ ^ 
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thus, as expected, the noisy scandictabihty given in Corollary [10] (and (jSOp ) is at least as small 
the clean scandictabihty of X, that is, with no noise at all. In most of the interesting cases, 
however, ([ST]) is a strict inequality. In fact, as mentioned in [28], (fSTj) is achieved with equality 
only when both X and are Gaussian and have proportional spectra. Consequently, unless 
X is white. Corollary [10] is non-trivial. 

4.1.2 Binary Random Fields 

In this case, the results of [13] and [16] shed light on the optimal scandictor. Therein, it was 
shown that for a binary prediction problem, i.e., where {Xt} is a binary source passed through 
a BSC with cross over probability 5 < ^, and {Yt} is the channel output, the more likely 
outcome for the clean bit is also the more likely outcome for the noisy bit. Thus, the optimal 
predictor in the Hamming sense for the next clean bit (based on the noisy observations) might 
as well use the same strategy as if it tries to predict the next noisy bit. Consequently, the 
optimal scandictor in the noisy setting is the one which is optimal for {Y}, and the results of 
[m Section V] apply. 

The following proposition relates the scandictabihty of a binary noise-corrupted process 
{y}, judged with respect to the clean binary process {X}, to its clean scandictabihty. 

Proposition 11. Let {(^t , ^)}tgz2 be a binary random field governed by a probability measure 
Q such that {Yj} is the output of a binary memoryless symmetric channel with cross over 
probability 6 and input {Xt}. Then, 



where Ih is the Hamming loss function. Furthermore, U{Ih,Q) is achieved by the scandictor 
which achieves U{Ih,Qy)- 

Note that indeed U {Ih, Qy) > 5 as y is the output of a BSC with crossover probability 5. 

Proof ( Proposition Let {-Bn}n>i be any sequence of elements in V, satisfying R{Bn) — > oo. 
We have. 



U{Ih,Q) 



U{Ih,Qy)-S 
1-26 



(82) 




\Bn\ 




(83) 
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and, analogously, 

{^,F)eS{B„) \Bn 



U{Ih, Qy,bJ = ^inf ^ ^ ^ P . . . , Y^,_,) + Y^,) . (84) 

t=i 



Denoting by Zf the channel noise at time t, and abbreviating i<t(Y^^, . . . , Y^^ -^) by Ft, we 
have 

PiFt^Y^,) = P{Ft^Y^„Z^, = l)+P{Ft^Y^,,Z^, = 0) 
= P{Ft = X^„Z^^ = 1) + P (Ft / X^„Z^^ = 0) 
= il-PiFt^X^,))6 + P{Ft^X^,){l-5). (85) 

Namely, for 5 < i, the optimal strategy for predicting Yq,^ based on Yxjj^, . . . ,Yy$^_^ and the 
optimal strategy for predicting Xii,^ based on Y^j, . . . , Y^t.i are identical, and, in addition. 

Substituting into ([55]) and taking n — > oo completes the proof. □ 

4.2 Universal Scandiction in the Noisy Scenario 

Section 14.11 dealt with the actual value of the best achievable performance in the noisy scan- 
diction scenario. However, it is also interesting to investigate the universal setting in which 
one seeks a predictor which does not depend on the joint probability measure of {{X, Y)}, yet 
performs asymptotically as well as a one matched to this measure. The problem of universal 
scandiction in the noiseless scenario was dealt with in |12j . Herein, we show that it is possible 
to construct universal scandictors in the noisy setting as well (similar to universal scanning 
and filtering in Section [3.3p . First, we show that it is possible to compete successfully with 
any finite set of scandictors, and present a universal scandictor for this setting. We then show 
that with a proper choice of a set of scandictors, it is possible to (universally) achieve U{1, Q), 
i.e., the noisy scandictability, for any spatially stationary random filed {X,Y). 

At the basis of the results of [12] stands the exponential weighting algorithm, originally 
derived by Vovk in [29]. In [29], Vovk considered a general set of experts and introduced 
the exponential weighting algorithm in order to compete with the best expert in the set. 
In this algorithm, each expert is assigned with a weight, according to its past performance. 
By decreasing the weight of poorly performing experts, hence preferring the ones proved to 
perform well thus far, one is able to compete with the best expert, having neither any a priori 
knowledge on the input sequence nor which expert will perform the best. It is clear that the 
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essence of this algorithm is the use of the cumulative losses incurred by each expert to construct 
a probability measure on the experts, which is later used to choose an expert for the next action. 
However, when the clean data X is not known to the sequential algorithm, it is impossible 
to calculate the cumulative losses of the experts precisely. Nevertheless, as Weissman and 
Merhav show in [15], using an unbiased estimate XtiYt) of Xt results in sufficiently accurate 
estimates of the cumulative losses of the experts, which in turn can be used by the exponential 
weighting algorithm. Hence, the framework derived in [12] can then be used to suggest universal 
scandictors for the noisy setting as well. 

Consider a random field {Xb^Yb) where X is binary and Y is either binary (e.g., the output 
of a BSC whose input is Y) or real valued (e.g., X through a Gaussian noise channel). For a 
loss function I : {0, 1} x [0, 1] — > [0, oo] we define, similarly to [15] . 

/o(-) = KO,-) and h{-) = l{l,-). (87) 

Assume (^,-F) is a scandictor for B. Then, for any t < \B\, we have 

t 

i=l 

= ^[(i-x^j/o(i^.(4r))+^*»^i(^^(4r'))]- (88) 

i=l 

Clearly, L(^q, p^{xB,yB)t depends on xb and is not known to the sequential algorithm. Let 
hd/q,.) be an unbiased estimate for Xi^/.. For example, when Y is the output of a BSC with 
input X we may choose 

h{y^.) = (89) 

Define 

t 

k^HyB)t = E - Ky^.)MFi{yll^')) + h{y^^)h{F,{yl[^'))\ , (90) 



i=l 



and 



^{^,F)i^B,yB)t = L(^^p){xB,yB)t-L(^^p){yB)t 



1=1 i=l 

(91) 



Similarly to [TH], we assume that the noise field Nb is of independent components and that 
for each i (z B, Yi £ a{Ni), i.e., the noise component at site i affects the observation at that 
site alone. In Appendix IA.31 we show that for any image xb and any scandictor {^,F) for 
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yB)tT^f^ is a zero mean martingale. As a result, for any scandictor {^,F), 
image xb and t we have 

^L(^,P)(XB, YB)t = EL^^^F){YB)t, (92) 

namely, L(^^ p-^(YB)t is an unbiased estimator for L^^, p-j{xB-,yB)t- The universal algorithm for 
scanning and prediction in the noisy scenario will thus use Lf^q, p-^{YB)t instead of L(,j, ^'^(xb, YB)t, 
which is unknown. More specifically, similarly to the algorithm proposed in [12], the algorithm 
divides the data array to be scandicted to blocks of size m{n) x m{n), then scans the data in a 
(fixed) block-wise order, where each block is scandicted using a scandictor chosen at random 
from the scandictors set, according to the distribution Pi (^i|{-£j,i}j=i) j 

where Lj i = X]rn=o -^('i',^)^ (y"^)' estimated cumulative loss of the scandictor {'if,F)j after 
scandicting i blocks of data, when (\I', F)j is restarted after each block, and A is the cardinality 

Note the subscript ui in J~m^ as in order to scandict a data 
array of size n x n, the universal algorithm discussed herein uses the scandictors with which 
it competes, but only on blocks of size m x m. 

The following proposition gives an upper bound on the redundancy of the algorithm when 
competing with a finite set of scandictors, each operating block- wise on the data array. 

Proposition 12. Let ELaig{xv„,yv„) be the expected (with respect to the noisy random field 
as well as the randomization in the algorithm) cumulative loss of the proposed algorithm on 
Y\/„; when the underlying clean array is xy^ and the noisy field is of independent components 
with Yi £ (T{Ni) for each i £ Vn C iP. Let ELminixY^,Yv„) denote the expected cumulative 
loss of the best scandictor in J^m, operating block-wise on Yv„. Assume \ J-m\ = A, then 

ELaig{xv„,YvJ - ELrnin{xv„,YvJ < m(n)(n + m(n) )\/lnA^^. (94) 

"'To be consistent with the notation of [12], the same notation is used for both a filtration and a scandictor set. 
The difference should be clear from the context. 
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Proof. By (j92p and [121 Proposition 3], for any xv„ we have 
ELaig (xv,, ,YvJ- min EL^^^p) (xy^ , Yv„ ) 

= ELaigiYvJ - min £;L(v£,^)(yy„) 

< i?L,/g {YvJ-E min L(^,^) (Yy^ ) 

= £; <^ I/aigl^Vn) - min L(^^^)(yv'„) 

< m(n)(n + m(n))\/ln A ^'™^ . (95) 

v2 

□ 

Proposition [12] is the basis for the main result in this sub-section, a universal scandictor 
which competes successfully with any finite set of scandictors for the noisy scenario. 

Theorem 13. Let {X,Y) be a stationary random field with a probability measure Q. Assume 
that for each i ^ 1? , Yi is the output of a memoryless channel whose input is X,. Let T = 
{f-n} be an arbitrary sequence of scandictor sets, where Tn is a set of scandictors for Vn and 
\J^n\ = X < oo for all n. Then, there exists a sequence of scandictors independent 
of Q, for which 

1 1 

lim inf Eq E—L,^ p. , IVJ < Hm inf rnin EQy — L(vt^) , Yv„ ) (96) 

for any Q G Msi^); where the inner expectation in the l.h.s. of (|96|) is due to the possible 
randomization in {^,F)n- 

The proof of Theorem [13] follows the proof of [12\ Theorem 2] verbatim. 

It is now possible to show the existence of a universal scandictor for any stationary random 
field in the noisy scandiction setting. Herein, we include only the setting where X is binary 
and Y is the output of a BSC. In this case, the scandictor is twofold-universal, namely, it does 
not depend on the channel crossover probability either. Extending the results to real- valued 
noise is possible using the methods introduced in [TB] (although the universal predictor does 
depend on the channel characteristics) and will be discussed later. 

Theorem 14. Let X be a stationary random field over a finite alphabet A and a probability 
measure Q. Let Y be the output of a BSC whose input is X and whose crossover probability 
6. Let the prediction space D be either finite or bounded (with l{x,F) then being Lipschitz in 
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its second argument). Then, there exists a sequence of scandictors {{^ , F)n} , independent of 
Q and of 5, for which 




for any Q G Msi^); where the inner expectation in the l.h.s. of ()97p is due to the possible 
randomization in 

Similar to |16^ Section A] and the proof of [12^ Theorem 6] , in the case of binary input and 
binary-valued noise it is possible to take the set of scandictors with which we compete as the 
set of all possible scandictors for an m{n) x m(n) block. The proof thus follows directly from 
the proof of |12l Theorem 6] . 

As for continuous- valued observations, it is quite clear that the set of all possible scandictors 
for an m{n) x m{n) block is far too rich to compete with (note that this is since the number 
of predictors is too large). A complete discussion is available in [161 Section B]. However, 
Weissman and Merhav do offer a method for successfully achieving the Bayes envelope for 
this setting, by introducing a much smaller set of predictors, which on one hand includes the 
best A:th order Markov predictor, yet on the other hand is not too rich, in the sense that 
the redundancy of the exponential weighting algorithm tends to zero when competing with 
an e-grid of it. Since presenting a universal scandictor for this scenario will mainly include a 
repetition of the many details discussed in [16] , we do not include it here. 

4.3 Bounds on the Excess Loss for Non-Optimal Scandictors 

Analogously to the scanning and filtering setting discussed in Section [3l and the clean predic- 
tion setting discussed in [12], it is interesting to investigate the excess loss incurred when non 
optimal scandictors are used in the noisy scandiction setting. Unlike the scanning and filtering 
setting, where the excess loss bounds were not a straightforward extensions of the results in 
|12j , in the noisy scandiction scenario this problem can be quite easily tackled using the results 
of [12] and modified loss functions. 

We briefly state the results of [12] in this context. The scenario considered therein is that 
of predicting the next outcome of a binary source, with D = [0, 1] as the prediction space, (pp 
denotes the Bayes envelope associated with the loss function p, i.e., 

0p(p)= mm[il-p)p{0,q)+ppil,q)]. (98) 

ge[0,l] 
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Similarly to (j54p . define 



ep = min max \ahb{p) + P — <t'p{p)\- 



(99) 



o,/9 0<p<l 

Note that although the definitions of (j)p{p) refer to the binary scenario, the result 

below holds for larger alphabets, with Cp defined as in (j99p . with the maximum ranging over 
the simplex of all distributions on the alphabet, and h{p) (replacing hh{p)) and (j)p{p) denoting 
the entropy and Bayes envelope of the distribution p, respectively. In [T^], it is shown that if 
Xb is an arbitrarily distributed binary random field, then, for any scan ^, 



Oip-r^M{XB) + l3p- EQg—L(^q, popt){XB) 



< e 



(100) 



where ap and (3p are the achievers of the minimum in (j99p . 

As mentioned earlier, if p{Y, F) is some loss function for the "clean" prediction problem of 
{Y}, the noisy process, then. 



E{p{Y,F)\a{X)}=p{X,F) 



for some p. Assuming a suitable p is found (i.e., p = I), we have, for any scan 



(101) 



"pT-g|-f^(^B) + Pp- J^EQgL'-^ popt{XB,YB) 



B\ 



(^p^^H{Yb) + I3p - ^^Eqj^LIp,p,{Yb] 



\B\ 



< e 



pi 



(102) 



where ^E'q^L'^ ^^^((Xb, Ye) is the normalized expected cumulative loss in optimally pre- 
dicting based on Y^^^^ , under the loss function /, -^^Eq^L'^^ p^^tiYB) is the normalized 
expected cumulative loss in optimally predicting Y^^ based on Y^^^^ , under the loss function 
p, and Up and (3p are the minimizers of as defined in (j99p . Hence, the following corollary 
applies. 

Corollary 15. Let Xb be an arbitrarily distributed binary field. Assume a white noise, and 
denote the noisy version o/Xb byYB- Let D = [0, 1] be the prediction space andl : {0, l}xD — > 
M be any loss function. Then, for any scan ^, 

1 



]^^EQgL^,popt{XB,YB)-U{l,QB) < 2ep, 
when p is a loss function such that E {p{Y.,F)\a{X)} = 1{X,F) for any F. 



(103) 
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Example 3 (BSC and Hamming Loss). In the case of binary input, BSC with crossover prob- 
abihty 6 and Hamming loss ///(•,•), it is not hard to show that 

lHiy,F)-6 



Hence, 



and 



1-25 



4>i„ (p) - S 

1-26 



1 



(104) 



(105) 



(106) 



1-26^'"' 

where eij^ = 0.08 as mentioned in [12]. The above bound on the excess loss can also be 
computed directly, without using Corollarv 1151 as for any scan ^, the normalized cumulative 
prediction errors are given by 

I I 



(107) 



for the noisy scenario, and 

I I 



(108) 



t=i 



for the (clean) prediction of Yb- Hence, using (I86p . for any scan ^ we have, 



EqbL'-^ popt{XB,YB) -U{Ih,Qb] 



\B\ 



1-26 



1 - 26 



1 



1 - 26 



< 



2e 



1 - 26 



(109) 



Example 4 {Additive Noise and Squared Error) . Let Yg be the output of an additive channel, 
with fj^ denoting the noise variance. Let Ig be the squared error loss function. In this case, 

e{ (Y^, - Fi(y*'-))2 - |a(X^j} = (X^, - Ft(y*'-))2 . (110) 

Thus, Corollarv [15] applies with p{Y,Y) = (Y — Y)'^ — o"^, and clearly ep = e/^. Note that 
although Corollary [15] is stated for binary alphabet, it is not hard to generalize its result to 
larger alphabets, as mentioned in [121 Section 4]. 
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4.3.1 Excess Loss Bounds Via the Continuous Time Mutual Information 

The bound on the excess noisy scandiction loss given in Corollary [15] was derived using the 
results of p2] and modified loss functions. However, new bounds can also be derived using the 
same method which was used in the proof of Theorem [3l namely, the scan invariance of the 
mutual information and the relation to the continuous time problem. We briefly discuss how 
such a bound can be derived for noisy scandiction of Gaussian fields corrupted by Gaussian 
noise. 

Using the notation of Section 13.2. H we have 



Var(X)- [ Var(Xf|y*)dt = 
Jo 



(7^ In ( 1 + ^ 



N 



- (Ill) 

where 

g{x) = x-ln{l + x). (112) 



Since ajr > Var |^Xijf- |y^^' and g{x) is monotonically increasing for x > 0, derivations 
similar to ([27|) lead to 

^EL^^^popt) {Xv„,YvJ < a%g (^) + ^2a%I {Xv^,YvJ . (113) 

On the other hand, since g{x) > for x > 0, we have 

^EL^^^popt^ iXv^,YvJ > ^2alliXv„,YvJ, (114) 

which now can be viewed as the scanning and prediction analogue of |18l eq. (156b)]. We thus 
have the following corollary. 

Corollary 16. Let Xv„ be a Gaussian random field with a constant marginal distribution 
satisfying Var{Xi) = a\ < oo for alii G Vn- Let Yi = Xi + Ni, where Ny^ is a white Gaussian 
noise of variance aj^, independent of Xv„- Then, for any two scans and and their 
optimal predictors, we have 

— \ELf^q,i^popt) {Xv„,Yv„) - EL(^q,2^popt) {Xv„,Yvj\ < a%g ( ^ ) • (115) 

Similarly as in Theorem [3l the bound in Corollarv 1161 has the form o'x ^^s'NR^ ' namely, it 
scales with the variance of the input. As expected, at the limit of low SNR, ^^g^^^ — > 0, since 
regardless of the scan, one is clueless about the underlying clean symbol. In fact, it is not 
surprising that this behavior is common to both the filtering and the prediction scenarios. In 
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the former, the bound value is given by (j23p . while in the latter it is given by (jllip . In both 
cases, the bound value is simply the difference between a continuous time filtering problem, 
and a discrete time filtering (or prediction, in (jllip ) problem. It is not hard to see that this 
difference tends to as SNR ^0"^. At the limit of high SNR, ^^^^-^^ 1- Indeed, this limit 
corresponds to the noiseless scandiction scenario, where scanning is consequential 



5 Conclusion 

We investigated problems in sequential filtering and prediction of noisy multidimensional data 
arrays. A bound on the best achievable scanning and filtering performance was derived, and 
the excess loss incurred when non-optimal scanners are used was quantified. In the prediction 
setting, a relation of the best achievable performance to that of the clean scandictability was 
given. In both the filtering and prediction scenarios, a special emphasis was given to the cases 
of AWGN and squared error loss, and BSC and Hamming loss. 

Due to their sequential nature, the problems discussed in this paper are strongly related 
to the filtering and prediction problems where reordering of the data is not allowed (or where 
there is only one natural order to scan the data), such as robust filtering and universal pre- 
diction discussed in the current literature. However, the numerous scanning possibilities in 
the multidimensional setting add a multitude of new challenges. In fact, many interesting 
problems remain open. It is clear that identifying the optimal scanning methods in the widely 
used input and channel models discussed herein is required, as the implementation of universal 
algorithms might be too complex in realistic situations. Moreover, tighter upper bounds on 
the excess loss can be derived in order to better understand the trade-offs between non-trivial 
scanning methods and the overall performance. Finally, by [Tl], the trivial scan is optimal 
for scandiction of noise-free Gaussian random fields. By Corollary [10] herein, this is also the 
case in scandiction of Gaussian fields corrupted by Gaussian noise. Whether the same hold 
for scanning and filtering of Gaussian random fields corrupted by Gaussian noise remains 
unanswered. 
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A Appendixes 

A.l Proof of Theorem [4] 

The proof resembles the proof of Theorem [H However, the derivations leading to the analogue 
of ([27|) are slightly different. For any input field Xy^, we have 



i=l 

/ Var(x^jy*;-\{y/^)},g[,_i,_i+i])dt 

Var (x^jy*;-\ dt - Var (x^Jlf; 

/ var(x^jy*;-\{y/^)}ig[,_i,_i+i])dt-r (Xy„,cT^) 



{«) 1 



:ii6) 



where (a) results from the definition of /*(). The rest of the proof follows similar to the proof 
of Theorem m since for any Xv„ and o"^ it is clear that f*{Xv„, cr^) is non negative. 



A. 2 Proof of Lemma [9] 

Without loss of generality, we assume ENt = 0. From Proposition [U we have, 



1 

C7(p„ Q) = Jim ^^^^inf ^^^^ ^ Yl (^*^ - ^^(y^^ ' ' 



:ii7) 



t=i 
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However, since Y^^ = X<j,^ + Nqi^, and N^^ is independent of N^^,, t' ^ t and of all {Xt}, we 
have 



\B„ I 

I nl ^^-|_ 



^ \Bn\ 

-2N^, {X^,+N^,-Ft{y^„...,y^,^,)) +Nl,^'^ 

^ \B„\ 



That is, 



Q) =j™^^^^^inf^^^^ -F^ (119) 



which completes the proof. 



A. 3 The Martingale Property of (A(^^^)(a:B, j/b)^, 

The proof follows that of [15^ Lemma 1]. However, notice that due to the data-dependent 
scanning JT* is not generated by a fixed set of random variables, that is, over a fixed set of 
sites, but by a set of t random variables which may be different for each instantiation of the 
random field (as for each t, depends on Y^*^^). Yet, the expectation will always be with 
respect to the random variables seen so far. 
By m, 

t t 

A^^^P){xB,yB)t = (^(y^J - UFiivlr')) + E - ^(2/*^)) h{F^{yl[-')). (120) 
1=1 1=1 

Defining 

t 

mt = Y, ih{y^J - ^^.) /o(i^.(>fr^)), (121) 

i=l 
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we have, 

E{mt+r\rf] = E\Y,{h{Y^:)-x^:)l,{n{Y^l-^))\P, 

U=i 

U=i 

= E{{h{Y^^^,) - 1^*} /o(Fm(>f;)) 

t 

i=l 

= E (/i(y*,+j - x^,^,) /o(Fi+i(y*;)) + mt 

= mt, (122) 

where the third equahty is since Yq,^^^ is J-^ measurable for any to < t, the fourth is since 
/i(Y^-i_i_i) — x^^^-^ is independent of J^* and the fifth is since h{Yq,^^-^) is an unbiased estimate 
for x^i/^^j^. Hence, {mt,J-f) is a zero-mean martingale (note that Enii = 0). Analogously, 
X]i=i (•^"I'i ~ ^(y*i)) ^i(-^«(y^P^)) is also a zero-mean martingale with respect to J-f, which 
completes the proof. 
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